policy parameter
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- North America > United States > California > Los Angeles County > Pasadena (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
The Ladder in Chaos: Improving Policy Learning by Harnessing the Parameter Evolving Path in A Low-dimensional Space Hongyao Tang
Deep Reinforcement Learning (DRL) is far from well understood, although its great potential has been demonstrated with a lot of achievements in different practical problems [Badia et al., 2020, Shah et al., 2022, Fawzi et al., 2022, Degrave et al., 2022, OpenAI, 2022]. Consistent efforts are made to gain a better understanding of the learning dynamics of RL agents.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
A Algorithm
This section consists of three parts, with each subsequent part building upon the previous one. Appendix A.1 covers the fundamentals of RL, where the actor-critic method is introduced. Appendix A.2 describes the RL algorithm for a single fulfillment agent, which is the proximal policy Appendix A.3 presents the MARL algorithm for the Currently, policy-based methods [Deisenroth et al., 2013] are prevalent because they are compatible with stochastic To sum up, the complete procedure is given in Algorithm 1.Algorithm 1 Heterogeneous Multi-Agent Reinforcement Learning for Order Fulfillment. With regard to the advantage estimator, we set the GAE parameters [Schulman et al., 2016] To highlight how our proposed benchmark differs from existing approaches focused on sub-tasks of order fulfillment, we compare the objectives, observations, and actions in Table 1. It should be noted that multiple formulations exist for each sub-task.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (7 more...)
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Learning Distinguishable Trajectory Representation with Contrastive Loss
Policy network parameter sharing is a commonly used technique in advanced deep multi-agent reinforcement learning (MARL) algorithms to improve learning efficiency by reducing the number of policy parameters and sharing experiences among agents. Nevertheless, agents that share the policy parameters tend to learn similar behaviors. To encourage multi-agent diversity, prior works typically maximize the mutual information between trajectories and agent identities using variational inference. However, this category of methods easily leads to inefficient exploration due to limited trajectory visitations. To resolve this limitation, inspired by the learning of pre-trained models, in this paper, we propose a novel Contrastive Trajectory Representation (CTR) method based on learning distinguishable trajectory representations to encourage multi-agent diversity.
Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems
The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries.We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting class of robust reinforcement learning problems under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.
Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning
Dai, Pengcheng, Wang, Dongming, Yu, Wenwu, Ren, Wei
This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its $κ_{p}$-hop neighbors, with $κ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $κ_{p}$-hop neighbors and the rewards of its $(κ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.
- North America > United States > California > Riverside County > Riverside (0.14)
- Asia > Singapore (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Hyper-GoalNet: Goal-Conditioned Manipulation Policy Learning with HyperNetworks
Zhou, Pei, Yao, Wanting, Luo, Qian, Zhou, Xunzhe, Yang, Yanchao
Goal-conditioned policy learning for robotic manipulation presents significant challenges in maintaining performance across diverse objectives and environments. We introduce Hyper-GoalNet, a framework that generates task-specific policy network parameters from goal specifications using hypernetworks. Unlike conventional methods that simply condition fixed networks on goal-state pairs, our approach separates goal interpretation from state processing -- the former determines network parameters while the latter applies these parameters to current observations. To enhance representation quality for effective policy generation, we implement two complementary constraints on the latent space: (1) a forward dynamics model that promotes state transition predictability, and (2) a distance-based constraint ensuring monotonic progression toward goal states. We evaluate our method on a comprehensive suite of manipulation tasks with varying environmental randomization. Results demonstrate significant performance improvements over state-of-the-art methods, particularly in high-variability conditions. Real-world robotic experiments further validate our method's robustness to sensor noise and physical uncertainties. Code is available at: https://github.com/wantingyao/hyper-goalnet.
- Asia > China > Hong Kong (0.04)
- North America > United States > Pennsylvania (0.04)
- Education (0.67)
- Health & Medicine > Therapeutic Area (0.46)